posted 01-13-2009 03:25 AM
Skipwebb (1/8/2009):
quote:
Bob, On those 100 charts...if the charts score as non-deceptive and the ground truth is deceptive is the examiner wrong on his call or did the person use some type of countermeasures. conversly if the examiner scores them as deceptive and the examinee was actually non-deceptive was the examiner wrong or was the examinee just a false positive that scored deceptive and was not.
My concern, as you might guess, is that I don't see how scoring the 100 ground truth charts tells us much about whether or not the person scoring them can conduct a polygraph and might be misleading about the examiner's ability to score charts.
I guess if the charts were scored by a large group of examiners and a clear majority called a set as deceptive rather than non-deceptive, oppostite the actual results, then are all the examiner wrong or are the charts (and therefore the polygraph)wrong?
Also is the requirement to get the correct condition (NDI/NDI/NO)? If you call a set NO is that an error or a no call.
I recently had a discussion with one of our QC guys about a "0" call on a component. He said I just don't see anything there so I gave it a "0". I asked him if we were both standing out in a field and I saw a deer across the field and remarked to him "Look at that pretty deer" and he replied "I don't see it." does that mean the deer is not there or does it mean he just couldn't see well enough to see the deer?
Excellent points on observer variability. It begins with the age-old if a tree falls in a forest... and gets a little more complicated from there. But the question of scoring competency is a manageable one.
The approach so far has been to use that sample of 100 confirmed cases, have the examiner score 'em, and see how they did.
As polygraph examiners we get a little riled about whether or not NO is an error. To a scientist, the correct answer is that we should be concerned about it both ways.
Sensitivity is measured as the proportion of Deceptive cases that are correctly determined in which a deceptive conclusion was reached both FN and INC decisions hurt the sensitivity level of any kind of test (polygraph or any other test).
Specificity is measured as the proportion of Truthful cases that are correctly determined. - in which a truthful conclusion was rendered. Both FP and INC decisions hurt the specificity level of any kind of test.
Decision accuracy is another matter, and should be calculated without INCs for truthful and deceptive cases, because that allows us to understand the results more easily, and more easily generalize the results to field settings with varying and unknown base-rates.
Here a complication exists, in whether to report decision accuracy as the simple proportion of correct decisions within the subsets of truthful and deceptive cases, or as the inverse of the FN and FP rates, which is a cross-subset calculation. There are reasons, in terms of generalizable knowledge, for doing it both ways. FN and FP are non-resistant to base-rates,and therefore difficult to generalize to settings with different base-rates. The within subset method is resistant and more easily generalized, but does not tell us much about real-life error rates.
It is possible for a test to have poor sensitivity and/or specificity, and still have high levels of decision accuracy. The Army MGQT for example can have good sensitivity to deception and weak specificity to truthfulness. However, when someone does pass the MGQT, the likelihood of a correct decision is very high. Fingerprint testing is another example of high accuracy with weak sensitivity (high inconclusives).
Look in the lead article of the recent journal and you'll see that Senter et al. provided tables of the result using both methods frequency and proportions of DI, NDI, and INC calls as a function of ground truth, and decision accuracy as a function of ground truth.
You'll also notice that Senter et al. reported average (unweighted) overall accuracy of the combined truthful and deceptive subsamples. Why, you ask, do they use unweighted accuracy? Because it allows us to more easily generalize and imagine how the test would perform in unbiased field settings (in which the external probability of involvement is at or near .5). There sample include n=33 in one group and n=36 in the other. This is close enough for jazz, but there are times when the groups or base-rate is much further from chance. Simply showing the overall accuracy of all cases would be asymptotically (big word) equivalent with the weighted average of the combined subsets.
A quick look at Senter et al. indicates the purpose was to to investigate the validity of the AFMGQT, but a more careful look reveals that part of the objective of the study was to validate a laboratory scenario for other validation studies.
Which brings us to the point (which is Skip's point, I think.)
Designing suitable protocols to answer research questions is not always the same as designing suitable protocols to answer questions about individual competency.
Science and validation are sometimes a matter of first triangulating a known position, else we're lost in a wilderness of unknowns, and then venturing out into further unknown area while maintaining a known attitude or orientation to that known position. The point isn't to be in unknown territory, but to make unknown territory into known territory. This is often accomplished only incrementally.
A research sample should be a random selection of confirmed cases, representative of the population and issue that we wish to study in this case truthful and deceptive persons. It sometimes occurs that intervening and interfering variables complicate the process. Interfering variables might be countermeasures, and may be the simple error rate about 10%. Error cases express the natural variability inherent in the underlying scientific constructs which means that even if one executes the method correctly a case will be classified incorrectly. Error cases are a fact of life, so we have to leave them in or our research sample or the sample is manipulated and not representative.
A certification or proficiency test is a different animal. The objective is not to determine the validity of the polygraph itself, but to determine the skill or training competency of an individual. It is therefore incorrect to attempt to use error cases in a competency test, because they mask the objective measuring the individual's execution of a learned skill. If we have an error case, for which a correct execution of the skill yields an incorrect classification, then the case should be removed from the competency test because it does not correctly reflect the acquisition or execution of the skill of interest.
One way to correctly use an error case in a competency test is to score the item inverse. This also presents an opportunity for an internal validity measure meaning that we expect a competent scorer to make an error on an error case. It is not possible to get a perfect score, and anyone who gets a perfect score will have some 'splainin to do.
Another solution would be to remove error cases from a certification test, keeping only those cases for which correct execution of the scoring skill would produce the correct classification. This does not completely solve the problem of variability.
Variability is expected among human scorers. Our task is to understand the variability of scores, including the location and shape of the distribution of scores.
We could study the distribution of scores, or decisions, for a sample of cases as a whole, but that it not the most informative design.
It would make more sense to construct a scoring proficiency test of individual segments that demand the execution of the expected skills pertaining the respiration, electrodermal, and cardiovascular data. If we are interested in how well an examiner executes the expected skill, all that is necessary to construct a validated test, is to have a group of accepted experts score the test to determine normative distributions of scores for each segment. Some segments will undoubtedly be less reliable than others, and those should be removed from the test. After selecting for items with usable interrater reliability, the test can be validated on a second group of experts.
Where, you ask, would be get those experts for the normative sample, and how would we select them? One idea would be to ask the participants in the two Marin/paired-testing studies to score the segments for normative and validation tasks selecting only those scorers whose skill satisfied the requirements of the protocol (86% correct, <20% INC). In this way, we scaffold on the findings of the NAS, and the paired testing research. Then we use skills (not scores or results) of verified experts to anchor our assessment of the skill of others. It is common in tests of this type to accept a test-taker's skill as adequate if they are within one or two standard errors (not standard deviations I think it's late) of the normative expert sample.
The nice thing about a proficiency test such as this: it can be constructed, normed, reconstructed, and re-normed without depending on a continual supply of verified exams.
Niters,
.02